Least Squares Temporal Difference Learning and Galerkin’s Method

نویسنده

Csaba Szepesvári

چکیده

The problem of estimating the value function underlying a Markovian reward process is considered. As it is well known, the value function underlying a Markovian reward process satisfied a linear fixed point equation. One approach to learning the value function from finite data is to find a good approximation to the value function in a given (linear) subspace of the space of value functions. We review some of the issues that arise when following this approach, as well as some results that characterize the finite-sample performance of some of the algorithms. 1 Markovian Reward Processes Let X be a measurable space and consider a stochastic process (X0, R1, X1, R2, X2, . . .), where Xt ∈ X and Rt+1 ∈ R, t = 0, 1, 2, . . .. The process is called a Markovian Reward process if • (X0, X1, . . .) is a Markov process, and • for any t ≥ 0, given Xt, Xt+1 the distribution of Rt+1 is independent of the history of the process. Here, Xt is called the state of the system at time t, while Rt+1 is the reward associated to transitioning from Xt to Xt+1. We shall denote by P the Markovian kernel underlying the process: Thus, the distribution of (Xt+1, Rt+1) given Xt is given by P(·, ·|Xt), t = 0, 1, . . .. Fix the so-called discount factor 0 ≤ γ ≤ 1 and define the (total discounted) return associated to the process R = ∞ ∑

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Non-Parametric Approach to Dynamic Programming

In this paper, we consider the problem of policy evaluation for continuousstate systems. We present a non-parametric approach to policy evaluation, which uses kernel density estimation to represent the system. The true form of the value function for this model can be determined, and can be computed using Galerkin’s method. Furthermore, we also present a unified view of several well-known policy...

متن کامل

Sustainable ℓ2-regularized actor-critic based on recursive least-squares temporal difference learning

Least-squares temporal difference learning (LSTD) has been used mainly for improving the data efficiency of the critic in actor-critic (AC). However, convergence analysis of the resulted algorithms is difficult when policy is changing. In this paper, a new AC method is proposed based on LSTD under discount criterion. The method comprises two components as the contribution: (1) LSTD works in an ...

متن کامل

Ensembles of extreme learning machine networks for value prediction

Value prediction is an important subproblem of several reinforcement learning (RL) algorithms. In a previous work, it has been shown that the combination of least-squares temporal-difference learning with ELM (extreme learning machine) networks is a powerful method for value prediction in continuous-state problems. This work proposes the use of ensembles to improve the approximation capabilitie...

متن کامل

Kernel Least-Squares Temporal Difference Learning

Kernel methods have attracted many research interests recently since by utilizing Mercer kernels, non-linear and non-parametric versions of conventional supervised or unsupervised learning algorithms can be implemented and usually better generalization abilities can be obtained. However, kernel methods in reinforcement learning have not been popularly studied in the literature. In this paper, w...

متن کامل

Locally Weighted Least Squares Temporal Difference Learning

This paper introduces locally weighted temporal difference learning for evaluation of a class of policies whose value function is nonlinear in the state. Least squares temporal difference learning is used for training local models according to a distance metric in state-space. Empirical evaluations are reported demonstrating learning performance on a number of strongly non-linear value function...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2011

Least Squares Temporal Difference Learning and Galerkin’s Method

نویسنده

چکیده

منابع مشابه

A Non-Parametric Approach to Dynamic Programming

Sustainable ℓ2-regularized actor-critic based on recursive least-squares temporal difference learning

Ensembles of extreme learning machine networks for value prediction

Kernel Least-Squares Temporal Difference Learning

Locally Weighted Least Squares Temporal Difference Learning

عنوان ژورنال:

اشتراک گذاری